Big Data: GeoScience part 3

MSDE

Lviv University

Spatial relationships

Overview

Question: how the data layers are located in relation to each other?

  • finding out if a certain point is located inside an area
  • or whether a line intersects with another line or a polygon

Note

  • These kind of queries are commonly called as spatial queries.
  • Spatial queries are conducted based on the topological spatial relations
    • contains
    • touches
    • intersects

Topological spatial relations

DE-9IM

  • There are different ways to compute these queries
  • Most GIS software rely on Dimensionally Extended 9-Intersection Model (DE-9IM).
  • DE-9IM defines the topological relations based on the interior, boundary, and exterior of two geometric shapes and how they intersect with each other.
  • DE-9IM also considers the dimensionality of the objects.
    • The Point objects are 0-dimensional
    • LineString and LinearRing are 1-dimensional
    • Polygon objects are 2-dimensional.

Topological spatial relations

Interior, boundary and exterior for different geometric data types. The data types can be either 0, 1 or 2-dimensional.

Topological spatial relations

Spatial predicates

When testing how two geometries relate to each other, the DE-9IM model gives a result which is called spatial predicate (also called as binary predicate).

  • intersects,
  • within,
  • contains,
  • overlaps
  • touches

There are plenty of topological relations: altogether 512 with 2D data.

Topological spatial relations

Eight common spatial predicates formed based on spatial relations between two geometries. Modified after Egenhofer et al. (1992).

Topological spatial relations

Types

  • When the geometries have at least one point in common, the geometries are said to be intersecting with each other.
  • When two geometries touch each other, they have at least one point in common (at the border in this case), but their interiors do not intersect with each other.
  • When the interiors of the geometries A and B are partially on top of each other and partially outside of each other, the geometries are overlapping with each other.
  • The spatial predicate for covers is when the interior of geometry B is almost totally within A, but they share at least one common coordinate at the border.
  • Similarly, if geometry A is almost totally contained by the geometry B (except at least one common coordinate at the border) the spatial predicate is called covered by.

Topological spatial relations

Making spatial queries in Python

In Python, all the basic spatial predicates are available from shapely library, including:

  • .intersects()
  • .within()
  • .contains()
  • .overlaps()
  • .touches()
  • .covers()
  • .covered_by()
  • .equals()
  • .disjoint()
  • .crosses()

Topological spatial relations

Making spatial queries in Python: within()

Create a couple of Point objects and one Polygon object which we can use to test how they relate to each other:

from shapely import Point, Polygon

# Create Point objects
point1 = Point(24.952242, 60.1696017)
point2 = Point(24.976567, 60.1612500)

# Create a Polygon
coordinates = [
    (24.950899, 60.169158),
    (24.953492, 60.169158),
    (24.953510, 60.170104),
    (24.950958, 60.169990),
]
polygon = Polygon(coordinates)

Topological spatial relations

Making spatial queries in Python

We can check the contents of the new variables by printing them to the screen, for example, in which case we would see

print(point1)
print(point2)
print(polygon)
POINT (24.952242 60.1696017)
POINT (24.976567 60.16125)
POLYGON ((24.950899 60.169158, 24.953492 60.169158, 24.95351 60.170104, 24.950958 60.16999, 24.950899 60.169158))

If you want to test whether these Point geometries stored in point1 and point2 are within the polygon, you can call the .within() method as follows:

point1.within(polygon)
True
point2.within(polygon)
False

Topological spatial relations

Making spatial queries in Python

One of the most common spatial queries is to see if a geometry intersects or touches another one. Again, there are binary operations in shapely for checking these spatial relationships:

  • .intersects() - Two objects intersect if the boundary or interior of one object intersect in any way with the boundary or interior of the other object.
  • .touches() - Two objects touch if the objects have at least one point in common and their interiors do not intersect with any part of the other object.

Topological spatial relations

Making spatial queries in Python

Let’s try these by creating two LineString geometries and test whether they intersect and touch each other:

from shapely import LineString, MultiLineString

# Create two lines
line_a = LineString([(0, 0), (1, 1)])
line_b = LineString([(1, 1), (0, 2)])
line_a.intersects(line_b)
True
line_a.touches(line_b)
True

Topological spatial relations

Making spatial queries in Python

As we can see, it seems that our two LineString objects are both intersecting and touching each other. We can confirm this by plotting the features together as a MultiLineString:

# Create a MultiLineString from line_a and line_b
multi_line = MultiLineString([line_a, line_b])
multi_line

Topological spatial relations

Making spatial queries in Python

If the lines are fully overlapping with each other they don’t touch due to the spatial relationship rule in the DE-9IM. We can confirm this by checking if line_a touches itself:

line_a.touches(line_a)
False

No it doesn’t. However, .intersects() and .equals() should produce True for a case when we compare the line_a with itself:

print("Intersects?", line_a.intersects(line_a))
print("Equals?", line_a.equals(line_a))
Intersects? True
Equals? True

Topological spatial relations

Making spatial queries in Python

The following prints results for all predicates between the point1 and the polygon which we created earlier:

print("Intersects?", point1.intersects(polygon))
print("Within?", point1.within(polygon))
print("Contains?", point1.contains(polygon))
print("Overlaps?", point1.overlaps(polygon))
print("Touches?", point1.touches(polygon))
print("Covers?", point1.covers(polygon))
print("Covered by?", point1.covered_by(polygon))
print("Equals?", point1.equals(polygon))
print("Disjoint?", point1.disjoint(polygon))
print("Crosses?", point1.crosses(polygon))
Intersects? True
Within? True
Contains? False
Overlaps? False
Touches? False
Covers? False
Covered by? True
Equals? False
Disjoint? False
Crosses? False

Topological spatial relations

within vs contains

  • if you have many points and just one polygon and you try to find out which one of them is inside the polygon: You might need to check the separately for each point to see which one is .within() the polygon.
  • if you have many polygons and just one point and you want to find out which polygon contains the point: You might need to check separately for each polygon to see which one(s) .contains() the point.

Spatial relations

Spatial queries using geopandas

Let’s check which points are located within specific areas of Helsinki:

import geopandas as gpd

points = gpd.read_file("_data/Helsinki/addresses.shp")
districts = gpd.read_file("_data/Helsinki/Major_districts.gpkg")
print("Shape:", points.shape)
print(points.head())
Shape: (34, 1)
                    geometry
0   POINT (24.91556 60.1632)
1  POINT (24.93166 60.16905)
2  POINT (24.94168 60.16996)
3  POINT (24.97865 60.19005)
4  POINT (24.92151 60.15662)
print("Shape:", districts.shape)
print(districts.tail(5))
Shape: (23, 3)
           Name Description                                           geometry
18    Koivukylä              POLYGON Z ((24.99423 60.33296 0, 25.00007 60.3...
19      Itäinen              POLYGON Z ((25.03517 60.23627 0, 25.03585 60.2...
20  Östersundom              POLYGON Z ((25.23352 60.25655 0, 25.23744 60.2...
21     Hakunila              POLYGON Z ((25.08472 60.27248 0, 25.08492 60.2...
22        Korso              POLYGON Z ((25.1238 60.34191 0, 25.11997 60.34...

The data contains 34 address points and 23 district polygons.

Spatial relations

Spatial queries using geopandas

Let’s find all points that are within two areas in Helsinki region, namely Itäinen and Eteläinen (‘Eastern’ and ‘Southern’ in English).

selection = districts.loc[districts["Name"].isin(["Itäinen", "Eteläinen"])]
print(selection.head())
         Name Description                                           geometry
10  Eteläinen              POLYGON Z ((24.78277 60.09997 0, 24.81973 60.1...
19    Itäinen              POLYGON Z ((25.03517 60.23627 0, 25.03585 60.2...

Spatial relations

Spatial queries using geopandas

Let’s now plot the layers on top of each other. The areas with red color represent the districts that we want to use for testing the spatial relationships against the point layer (shown with blue color):

base = districts.plot(facecolor="gray")
selection.plot(ax=base, facecolor="red")
points.plot(ax=base, color="blue", markersize=5)

Spatial relations

Spatial queries using geopandas

We should check which Points in the points GeoDataFrame are within the selected polygons stored in the selection geodataframe.

We will use a spatial join as an efficient way to conduct spatial queries in geopandas.

selected_points = points.sjoin(selection.geometry.to_frame(), predicate="within")
ax = districts.plot(facecolor="gray")
ax = selection.plot(ax=ax, facecolor="red")
ax = selected_points.plot(ax=ax, color="gold", markersize=2)

Spatial relations

to_frame()

Using to_frame() allows us to avoid attaching any extra attributes from the selection geodataframe to our data, which is what .sjoin() method would normally do (and which it is actually designed for). .geometry.to_frame() will:

  • first select the geometry column from the selection layer
  • and then convert it into a GeoDataFrame (which would otherwise be a GeoSeries).

An alternative approach for doing the same thing is to use selection[[selection.active_geometry_name]], which also returns a GeoDataFrame containing only a column with the geodataframe’s active geometry.

Spatial relations

sjoin() examples

By default, the .sjoin() uses "intersects" as a spatial predicate, but it is easy to change this. For example, we can investigate which of the districts contain at least one point.

districts_with_points = districts.sjoin(
    points.geometry.to_frame(), predicate="contains"
)
ax = districts.plot(facecolor="gray")
ax = districts_with_points.plot(ax=ax, edgecolor="gray")
ax = points.plot(ax=ax, color="red", markersize=2)

Spatial relations

sjoin() examples

To find all possible spatial predicates for a given GeoDataFrame you can call:

districts.sindex.valid_query_predicates
{None,
 'contains',
 'contains_properly',
 'covered_by',
 'covers',
 'crosses',
 'dwithin',
 'intersects',
 'overlaps',
 'touches',
 'within'}

What is .sindex?

districts.sindex
<geopandas.sindex.SpatialIndex at 0x1216f7dd0>

Spatial relations

sindex

SpatialIndex object contains the spatial index for our data.

  • A spatial index is a special data structure that allows for efficient querying of spatial data.
  • geopandas uses a spatial index called R-tree which is a hierarchical, tree-like structure that divides the space into nested, overlapping rectangles and indexes the bounding boxes of each geometry.
  • Hence, when selecting data based on topological relations, we recommend using .sjoin() instead of directly calling .within(), .contains() that come with the shapely geometries (as shown previously).

Spatial relations

Exercise

How many addresses are located in each district? You can find out the answer by grouping the spatial join result based on the district name.

Spatial join

Description

Spatial join retrieves table attributes from one layer and transfers them into another layer based on their spatial relationship. For example, one can join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that intersects with the point.

It is good to remember that spatial join is always conducted between two layers at a time.

Spatial join

Spatial join allows you to combine attribute information from multiple layers based on spatial relationship.

Spatial join

sjoin details

In spatial join, there are two set of options that you can control, which ultimately influence how the data is transferred between the layers:

  1. How the spatial relationship between geometries should be checked (i.e. spatial predicates), and
  2. What type of table join you want to conduct (inner, left, or right outer join)

Spatial predicates

  • The spatial predicates control how the spatial relationship between the geometries in the two data layers is checked.
  • Only those cases where the spatial predicate returns True will be kept in the result.

Spatial join

Spatial join

sjoin type

The other parameter that you can use to control how the spatial join is conducted is the spatial join type. There are three different join types that influence the outcome of the spatial join:

  1. inner join
  2. left outer join
  3. right outer join

Spatial join

Spatial join with Python

Spatial join can be done easily with geopandas using the .sjoin() method. Next, we will learn how to use this method to perform a spatial join between two layers:

  1. addresses which are the locations that we geocoded previously;
  2. population grid which is a 250m x 250m grid polygon layer that contains population information from the Helsinki Region (source: Helsinki Region Environmental Services Authority).

Spatial join

Spatial join with Python

Let’s start by reading the data:

import geopandas as gpd

addr_fp = "_data/Helsinki/addresses.shp"
addresses = gpd.read_file(addr_fp)
addresses.head(2)
geometry
0 POINT (24.91556 60.1632)
1 POINT (24.93166 60.16905)

As we can see, the addresses variable contains address Points which represent a selection of public transport stations in the Helsinki Region.

Spatial join

Spatial join with Python

pop_grid_fp = "_data/Helsinki/Population_grid_2021_HSY.gpkg"
pop_grid = gpd.read_file(pop_grid_fp)
pop_grid.head(2)
id inhabitants occupancy_rate geometry
0 Vaestotietoruudukko_2021.1 5 50.60 POLYGON ((25472499.995 6689749.005, 25472499.9...
1 Vaestotietoruudukko_2021.2 7 36.71 POLYGON ((25472499.995 6685998.998, 25472499.9...

The pop_grid dataset contains few columns, namely a unique id, the number of inhabitants per grid cell, and the occupancy_rate as percentage.

Spatial join

Preparations for spatial join

The basic requirement for a successful spatial join is that the layers should overlap with each other in space. If the geometries between the layers do not share the same CRS, it is very likely that the spatial join will fail and produces an empty GeoDataFrame.

print("Address points CRS:", addresses.crs)
print("Population grid CRS:", pop_grid.crs.name)
Address points CRS: None
Population grid CRS: ETRS89 / GK25FIN

To fix this issue, let’s reproject the geometries in the addresses GeoDataFrame to the same CRS as pop_grid using the .to_crs() method.

# Reproject
addresses = addresses.set_crs(epsg=4326)
addresses = addresses.to_crs(crs=pop_grid.crs)

# Validate match
addresses.crs == pop_grid.crs
True
addresses.head(2)
geometry
0 POINT (25495311.608 6672258.695)
1 POINT (25496206.216 6672909.016)

Spatial join

Preparations for spatial join

As a last preparatory step, let’s visualize both datasets on top of each other to see how the inhabitants are distributed over the region, and how the address points are located in relation to the grid:

Spatial join

Join the layers based on spatial relationship

The aim here is to get information about

How many people live in a given polygon that contains an individual address-point?

Thus, we want to join the attribute information from the pop_grid layer into the addresses Point layer using the .sjoin() method.

join = addresses.sjoin(pop_grid, predicate="within", how="inner")
join
geometry index_right id inhabitants occupancy_rate
0 POINT (25495311.608 6672258.695) 3262 Vaestotietoruudukko_2021.3263 505 14.01
1 POINT (25496206.216 6672909.016) 3381 Vaestotietoruudukko_2021.3382 172 27.67
2 POINT (25496762.723 6673010.538) 3504 Vaestotietoruudukko_2021.3505 43 61.44
3 POINT (25498815.415 6675246.744) 3845 Vaestotietoruudukko_2021.3846 757 33.98
4 POINT (25495641.151 6671525.076) 3310 Vaestotietoruudukko_2021.3311 1402 29.76
5 POINT (25504528.607 6680282.118) 5058 Vaestotietoruudukko_2021.5059 283 20.51
6 POINT (25506082.816 6678702.5) 5407 Vaestotietoruudukko_2021.5408 155 43.42
7 POINT (25501631.384 6685110.943) 4302 Vaestotietoruudukko_2021.4303 236 37.62
8 POINT (25501586.789 6683452.706) 4309 Vaestotietoruudukko_2021.4310 253 33.56
10 POINT (25496896.718 6673162.114) 3504 Vaestotietoruudukko_2021.3505 43 61.44
11 POINT (25493506.585 6679793.504) 2985 Vaestotietoruudukko_2021.2986 629 36.11
12 POINT (25493093.193 6680589.107) 2886 Vaestotietoruudukko_2021.2887 817 29.63
13 POINT (25497118.126 6678784.269) 3535 Vaestotietoruudukko_2021.3536 163 33.08
14 POINT (25500717.139 6682045.953) 4114 Vaestotietoruudukko_2021.4115 99 33.06
15 POINT (25494134.034 6678278.391) 3079 Vaestotietoruudukko_2021.3080 425 37.04
16 POINT (25492575.043 6681972.681) 2775 Vaestotietoruudukko_2021.2776 353 32.35
17 POINT (25498089.949 6679640.055) 3728 Vaestotietoruudukko_2021.3729 255 38.51
19 POINT (25492287.623 6679040.234) 2723 Vaestotietoruudukko_2021.2724 231 33.98
20 POINT (25499646.711 6681218.844) 3945 Vaestotietoruudukko_2021.3946 336 35.39
22 POINT (25504344.005 6677452.185) 5008 Vaestotietoruudukko_2021.5009 453 21.73
23 POINT (25507495.008 6677173.504) 5626 Vaestotietoruudukko_2021.5627 231 35.18
24 POINT (25504169.174 6679129.384) 4942 Vaestotietoruudukko_2021.4943 638 33.41
25 POINT (25506082.506 6680577.96) 5399 Vaestotietoruudukko_2021.5400 361 24.29
26 POINT (25497840.365 6675020.358) 3696 Vaestotietoruudukko_2021.3697 829 33.92
27 POINT (25501570.947 6675734.321) 4330 Vaestotietoruudukko_2021.4331 271 37.29
28 POINT (25500377.732 6675098.523) 4079 Vaestotietoruudukko_2021.4080 321 38.40
29 POINT (25497199.344 6674065.972) 3552 Vaestotietoruudukko_2021.3553 215 43.71
30 POINT (25496286.911 6672913.768) 3411 Vaestotietoruudukko_2021.3412 286 25.86
31 POINT (25496128.995 6672625.288) 3382 Vaestotietoruudukko_2021.3383 1409 32.37
32 POINT (25495624.409 6671766.187) 3309 Vaestotietoruudukko_2021.3310 995 37.07
33 POINT (25497062.747 6673226.621) 3555 Vaestotietoruudukko_2021.3556 281 36.14

Spatial join

Join the layers based on spatial relationship

  • We received information about inhabitants and occupancy_rate
  • We got columns index_right and id_right which tell the index and id of the matching polygon in the right-side member of the spatial join
  • The id column in the left-side member of the spatial join was renamed as id_left.
  • The suffices _left and _right are appended to the column names to differentiate the columns in cases where there are identical column names present in both GeoDataFrames.

Spatial join

Join the layers based on spatial relationship

Let’s also visualize the joined output. In the following, we plot the points using the inhabitants column to indicate the color:

ax = join.plot(
    column="inhabitants",
    cmap="Reds",
    markersize=15,
    scheme="quantiles",
    legend=True,
    figsize=(10, 6),
)
ax.set_title("Amount of inhabitants living close to the point");

Spatial join

Join the layers based on spatial relationship

It is useful to investigate if we lost any data while doing the spatial join. Let’s check this by comparing the number of rows in our result to how many addresses we had originally:

len(addresses) - len(join)
3

We can investigate where the points outside of polygons are located:

m = pop_grid.explore(color="blue", style_kwds=dict(color="blue", stroke=False))
addresses.explore(m=m, color="red")
Make this Notebook Trusted to load map: File -> Trust Notebook

Spatial join

Join the layers based on spatial relationship

We might want to keep the information for the points that did not get a match based on the spatial relationship.

left_join = addresses.sjoin(pop_grid, predicate="within", how="left")
left_join
geometry index_right id inhabitants occupancy_rate
0 POINT (25495311.608 6672258.695) 3262.0 Vaestotietoruudukko_2021.3263 505.0 14.01
1 POINT (25496206.216 6672909.016) 3381.0 Vaestotietoruudukko_2021.3382 172.0 27.67
2 POINT (25496762.723 6673010.538) 3504.0 Vaestotietoruudukko_2021.3505 43.0 61.44
3 POINT (25498815.415 6675246.744) 3845.0 Vaestotietoruudukko_2021.3846 757.0 33.98
4 POINT (25495641.151 6671525.076) 3310.0 Vaestotietoruudukko_2021.3311 1402.0 29.76
5 POINT (25504528.607 6680282.118) 5058.0 Vaestotietoruudukko_2021.5059 283.0 20.51
6 POINT (25506082.816 6678702.5) 5407.0 Vaestotietoruudukko_2021.5408 155.0 43.42
7 POINT (25501631.384 6685110.943) 4302.0 Vaestotietoruudukko_2021.4303 236.0 37.62
8 POINT (25501586.789 6683452.706) 4309.0 Vaestotietoruudukko_2021.4310 253.0 33.56
9 POINT (25492851.783 6678869.234) NaN NaN NaN NaN
10 POINT (25496896.718 6673162.114) 3504.0 Vaestotietoruudukko_2021.3505 43.0 61.44
11 POINT (25493506.585 6679793.504) 2985.0 Vaestotietoruudukko_2021.2986 629.0 36.11
12 POINT (25493093.193 6680589.107) 2886.0 Vaestotietoruudukko_2021.2887 817.0 29.63
13 POINT (25497118.126 6678784.269) 3535.0 Vaestotietoruudukko_2021.3536 163.0 33.08
14 POINT (25500717.139 6682045.953) 4114.0 Vaestotietoruudukko_2021.4115 99.0 33.06
15 POINT (25494134.034 6678278.391) 3079.0 Vaestotietoruudukko_2021.3080 425.0 37.04
16 POINT (25492575.043 6681972.681) 2775.0 Vaestotietoruudukko_2021.2776 353.0 32.35
17 POINT (25498089.949 6679640.055) 3728.0 Vaestotietoruudukko_2021.3729 255.0 38.51
18 POINT (25496358.8 6676198.28) NaN NaN NaN NaN
19 POINT (25492287.623 6679040.234) 2723.0 Vaestotietoruudukko_2021.2724 231.0 33.98
20 POINT (25499646.711 6681218.844) 3945.0 Vaestotietoruudukko_2021.3946 336.0 35.39
21 POINT (25501142.787 6681208.443) NaN NaN NaN NaN
22 POINT (25504344.005 6677452.185) 5008.0 Vaestotietoruudukko_2021.5009 453.0 21.73
23 POINT (25507495.008 6677173.504) 5626.0 Vaestotietoruudukko_2021.5627 231.0 35.18
24 POINT (25504169.174 6679129.384) 4942.0 Vaestotietoruudukko_2021.4943 638.0 33.41
25 POINT (25506082.506 6680577.96) 5399.0 Vaestotietoruudukko_2021.5400 361.0 24.29
26 POINT (25497840.365 6675020.358) 3696.0 Vaestotietoruudukko_2021.3697 829.0 33.92
27 POINT (25501570.947 6675734.321) 4330.0 Vaestotietoruudukko_2021.4331 271.0 37.29
28 POINT (25500377.732 6675098.523) 4079.0 Vaestotietoruudukko_2021.4080 321.0 38.40
29 POINT (25497199.344 6674065.972) 3552.0 Vaestotietoruudukko_2021.3553 215.0 43.71
30 POINT (25496286.911 6672913.768) 3411.0 Vaestotietoruudukko_2021.3412 286.0 25.86
31 POINT (25496128.995 6672625.288) 3382.0 Vaestotietoruudukko_2021.3383 1409.0 32.37
32 POINT (25495624.409 6671766.187) 3309.0 Vaestotietoruudukko_2021.3310 995.0 37.07
33 POINT (25497062.747 6673226.621) 3555.0 Vaestotietoruudukko_2021.3556 281.0 36.14

Spatial join

Join the layers based on spatial relationship

Let’s investigate a bit more to see which rows did not have a matching polygon in the population grid.

left_join.loc[left_join["inhabitants"].isnull()]
geometry index_right id inhabitants occupancy_rate
9 POINT (25492851.783 6678869.234) NaN NaN NaN NaN
18 POINT (25496358.8 6676198.28) NaN NaN NaN NaN
21 POINT (25501142.787 6681208.443) NaN NaN NaN NaN

Spatial join

Exercise

Do the spatial join another way around, i.e. make a spatial join where you join information from the address points into the population grid. How does the result differ from the version where we joined information from the grids to the points? What would be the benefit of doing the join this way around?

Nearest neighbour analysis

Description

  • a fundamental concept in geographic data analysis and modelling.
  • many of these techniques rely on the idea that proximity in geographic space typically indicates also similarity in attribute space.

First Law of Geography

Everything is related to everything, but near things are more related than distant things.

Nearest neighbour analysis

The basic idea of finding a nearest neighbour based on geographic distance.

Nearest neighbour analysis

Why limit?

  • computational limitations
  • we have specific reasoning that makes it sensible to limit the search area.

Nearest neighbour analysis in Python

Overview

Typical tasks:

  • find the nearest neighbors for all Point geometries in a given GeoDataFrame based on Pointobjects from another GeoDataFrame
  • find nearest neighbor between two Polygon datasets
  • use scipy library to find K-Nearest Neighbors (KNN) with Point data.

Nearest neighbour analysis in Python

Practical example

Where is the closest public transport stop from my place of residence?

  • Our aim is to search for each building point in the Helsinki Region the closest public transport stop.
  • In geopandas, we can find nearest neighbors for all geometries in a given GeoDataFrame using the .sjoin_nearest() method.
import geopandas as gpd
import matplotlib.pyplot as plt

stops = gpd.read_file("_data/Helsinki/pt_stops_helsinki.gpkg")
building_points = gpd.read_file("_data/Helsinki/building_points_helsinki.zip")

print("Number of stops:", len(stops))
stops.head(2)
Number of stops: 8377
stop_name stop_lat stop_lon stop_id geometry
0 Ritarihuone 60.16946 24.95667 1010102 POINT (24.95667 60.16946)
1 Kirkkokatu 60.17127 24.95657 1010103 POINT (24.95657 60.17127)
print("Number of buildings:", len(building_points))
building_points.head(2)
Number of buildings: 158731
name geometry
0 None POINT (24.85584 60.20727)
1 Uimastadion POINT (24.93045 60.18882)

Nearest neighbour analysis in Python

Practical example: visualize

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15, 10))

# Plot buildings
building_points.plot(ax=ax1, markersize=0.2, alpha=0.5)
ax1.set_title("Buildings")

# Plot stops
stops.plot(ax=ax2, markersize=0.2, alpha=0.5, color="red")
ax2.set_title("Stops");

Nearest neighbour analysis in Python

sjoin_nearest

  • merges in a similar way to .sjoin() method
  • however, in this case the method is actually searching for the closest geometries instead of relying on spatial predicates, such as within
  • the method uses a spatial index called STRTree which is an efficient implementation of the R-tree dynamic index structure for spatial searching.

Nearest neighbour analysis in Python

Reproject

Let’s start by reprojecting our latitude and longitude values into a metric system using the national EUREF-FIN coordinate reference system (EPSG code 3067) for Finland:

stops = stops.to_crs(epsg=3067)
building_points = building_points.to_crs(epsg=3067)

stops.head(2)
stop_name stop_lat stop_lon stop_id geometry
0 Ritarihuone 60.16946 24.95667 1010102 POINT (386623.301 6672037.884)
1 Kirkkokatu 60.17127 24.95657 1010103 POINT (386623.991 6672239.572)

Nearest neighbour analysis in Python

Use sjoin_nearest

Because we are interested to find the closest stop geometry for each building, the buildings GeoDataFrame is on the left hand side of the command.

Note that we give a name for a column which is used to store information about the distance between a given building and the closest stop (this is optional):

%time
closest = building_points.sjoin_nearest(stops, distance_col="distance")
closest
CPU times: user 0 ns, sys: 1 μs, total: 1 μs
Wall time: 1.19 μs
name geometry index_right stop_name stop_lat stop_lon stop_id distance
0 None POINT (381166.6 6676424.438) 1131 Muusantori 60.20749 24.857450 1304138 92.679893
1 Uimastadion POINT (385236.565 6674238.472) 467 Auroran sairaala 60.19145 24.925540 1171122 400.243370
2 None POINT (386317.478 6672100.648) 61 Senaatintori 60.16901 24.950460 1020450 109.819633
3 Hartwall Arena POINT (385225.109 6676120.56) 532 Veturitie 60.20661 24.929680 1174112 104.632434
4 Talli POINT (385079.733 6676989.745) 496 Posti 1 60.21345 24.917550 1172143 472.248282
... ... ... ... ... ... ... ... ...
158726 None POINT (373896.051 6677472.204) 4621 Samis 60.21369 24.720970 3170209 195.675552
158727 None POINT (372425.65 6676945.528) 4654 Yrjö Liipolan tie 60.20922 24.695470 3170244 137.137640
158728 None POINT (374696.625 6677972.738) 4655 Kandidaatintie 60.21818 24.736987 3170245 135.341745
158729 Granhultsskolan POINT (373287.582 6677731.639) 4624 Uimahalli 60.21638 24.711260 3170212 99.408108
158730 Kauniaisten kirkko POINT (374112.695 6677330.017) 4665 Postitori 60.21267 24.728250 3170257 67.790422

159818 rows × 8 columns

Nearest neighbour analysis in Python

sjoin_nearest

  • The last column in the table shows the distance in meters between a given building and the closest stop.
  • The distance is only returned upon request as we did by specifying distance_col="distance".
  • The column index_right provides information about the index number of the closest stop in the stops GeoDataFrame.
  • number of rows in our result has actually increased, because for some geometries in the buildings GeoDataFrame, the distance between the building and two (or more) stops have been exactly the same (i.e. they are equidistant).
  • In such cases, the sjoin_nearest() will store both records into the results by duplicating the building information and attaching information from the stops into separate rows accordingly.

Nearest neighbour analysis in Python

sjoin_nearest

We can make computations faster by specifying a max_distance parameter that specifies the maximum search distance:

%time
closest_limited = building_points.sjoin_nearest(
    stops, max_distance=100, distance_col="distance"
)
closest_limited
CPU times: user 1e+03 ns, sys: 1 μs, total: 2 μs
Wall time: 1.91 μs
name geometry index_right stop_name stop_lat stop_lon stop_id distance
0 None POINT (381166.6 6676424.438) 1131 Muusantori 60.207490 24.857450 1304138 92.679893
10 None POINT (384645.078 6669763.917) 592 Hernesaaren laituri 60.148287 24.923281 1204101 57.786201
12 None POINT (384782.782 6669707.017) 595 Hernesaaren laituri 60.148680 24.924240 1204108 79.844881
13 None POINT (384714.47 6669710.887) 592 Hernesaaren laituri 60.148287 24.923281 1204101 32.640335
16 None POINT (385040.806 6670639.517) 596 Hernesaarenkatu 60.156110 24.930370 1204109 87.888087
... ... ... ... ... ... ... ... ...
158718 None POINT (374219.973 6677006.1) 4651 Kauniaisten asema 60.210830 24.729330 3170240 69.803673
158719 None POINT (374231.494 6676967.402) 4642 Kauniaistentie 60.209810 24.731510 3170230 63.384115
158720 None POINT (374602.815 6677396.18) 4673 Raamattuopisto 60.213524 24.736685 3170265 56.594370
158729 Granhultsskolan POINT (373287.582 6677731.639) 4624 Uimahalli 60.216380 24.711260 3170212 99.408108
158730 Kauniaisten kirkko POINT (374112.695 6677330.017) 4665 Postitori 60.212670 24.728250 3170257 67.790422

40128 rows × 8 columns

Nearest neighbour analysis in Python

Connect with a line

  • we need to merge also the Point geometries from the other layer into our results, which can then be used to create a LineString connecting the points to each other.
  • for merging the closest stop geometries into our results, we can take advantage of the index_right column in our table and conduct a normal table join using the .merge() method
  • we only keep the geometry columns from the stops GeoDataFrame because all the other attributes already exist in our results:
closest = closest.merge(
    stops[[stops.active_geometry_name]], left_on="index_right", right_index=True
)
closest.head()
name geometry_x index_right stop_name stop_lat stop_lon stop_id distance geometry_y
0 None POINT (381166.6 6676424.438) 1131 Muusantori 60.20749 24.85745 1304138 92.679893 POINT (381256.66 6676446.317)
1 Uimastadion POINT (385236.565 6674238.472) 467 Auroran sairaala 60.19145 24.92554 1171122 400.243370 POINT (384973.331 6674539.973)
2 None POINT (386317.478 6672100.648) 61 Senaatintori 60.16901 24.95046 1020450 109.819633 POINT (386277.25 6671998.462)
3 Hartwall Arena POINT (385225.109 6676120.56) 532 Veturitie 60.20661 24.92968 1174112 104.632434 POINT (385255.784 6676220.595)
4 Talli POINT (385079.733 6676989.745) 496 Posti 1 60.21345 24.91755 1172143 472.248282 POINT (384607.679 6677003.267)

Nearest neighbour analysis in Python

Connect with a line

  • In order to create a connecting LineString between the buildings and the closest stops, we can use the linestrings() function of the shapely library
  • To extract the point coordinates from the Point objects stored in the geometry_x and geometry_y columns, we use the .get_coordinates() method of geopandas that returns the x and y coordinates as Series objects/columns.
  • Then we convert these into numpy arrays using the to_numpy() method which we pass to the linestrings() function.
  • Finally, we store the resulting LineStrings into a column geometry which we set as the active geometry of the GeoDataFrame:
from shapely import linestrings

closest["geometry"] = linestrings(
    closest.geometry_x.get_coordinates().to_numpy(),
    closest.geometry_y.get_coordinates().to_numpy(),
)

closest = closest.set_geometry("geometry")
closest.head()
name geometry_x index_right stop_name stop_lat stop_lon stop_id distance geometry_y geometry
0 None POINT (381166.6 6676424.438) 1131 Muusantori 60.20749 24.85745 1304138 92.679893 POINT (381256.66 6676446.317) LINESTRING (381166.6 381256.66, 6676424.438 66...
1 Uimastadion POINT (385236.565 6674238.472) 467 Auroran sairaala 60.19145 24.92554 1171122 400.243370 POINT (384973.331 6674539.973) LINESTRING (385236.565 384973.331, 6674238.472...
2 None POINT (386317.478 6672100.648) 61 Senaatintori 60.16901 24.95046 1020450 109.819633 POINT (386277.25 6671998.462) LINESTRING (386317.478 386277.25, 6672100.648 ...
3 Hartwall Arena POINT (385225.109 6676120.56) 532 Veturitie 60.20661 24.92968 1174112 104.632434 POINT (385255.784 6676220.595) LINESTRING (385225.109 385255.784, 6676120.56 ...
4 Talli POINT (385079.733 6676989.745) 496 Posti 1 60.21345 24.91755 1172143 472.248282 POINT (384607.679 6677003.267) LINESTRING (385079.733 384607.679, 6676989.745...

Nearest neighbour analysis in Python

Connect with a line

Let’s create a map that visualizes the buildings, stops and the connecting lines between the buildings and the closest stops in a single figure:

ax = closest.plot(lw=0.5, color="blue", figsize=(10, 10))
ax = building_points.plot(ax=ax, color="red", markersize=2)
ax = stops.plot(ax=ax, color="black", markersize=8.5, marker="s")
# Zoom to specific area
ax.set_xlim(382000, 384100)
ax.set_ylim(6676000, 6678000);

Nearest neighbour analysis in Python

Connect with a line

It is easy to do some descriptive analysis based on this data providing information about levels of access to public transport in the region:

closest["distance"].describe()
count    159818.000000
mean        229.029606
std         292.348698
min           0.743490
25%          99.771301
50%         163.805853
75%         260.461391
max        7698.270635
Name: distance, dtype: float64

Nearest neighbour analysis in Python

Nearest neighbors with Polygon and LineString data

Let’s find the closest urban park to building polygons in a neighborhood called Kamppi, which is located in Helsinki, Finland. Then, we’ll find the closest drivable road (LineString) to each building.

import geopandas as gpd

buildings = gpd.read_file("_data/Helsinki/Kamppi_buildings.gpkg")
parks = gpd.read_file("_data/Helsinki/Kamppi_parks.gpkg")
roads = gpd.read_file("_data/Helsinki/Kamppi_roads.gpkg")
buildings
osmid building name geometry
0 11711721042 yes Nice Bike Pyörähuolto POINT (384966.661 6671503.786)
1 8035238 public Lasipalatsi POLYGON ((385459.65 6672184.469, 385456.356 66...
2 8042297 yes Radisson Blu Royal POLYGON ((385104.154 6671916.693, 385101.584 6...
3 14797170 school None POLYGON ((384815.326 6671762.71, 384815.792 66...
4 14797171 yes None POLYGON ((384797.759 6671853.253, 384798.253 6...
... ... ... ... ...
450 8092998 yes None POLYGON ((384747.465 6671811.996, 384744.27 66...
451 8280536 apartments None POLYGON ((384839.007 6671934.815, 384839.485 6...
452 8525159 civic None POLYGON ((385495.275 6672164.009, 385494.928 6...
453 8525161 civic None POLYGON ((385486.225 6672173.653, 385486.717 6...
454 8535506 civic None POLYGON ((385481.13 6672167.861, 385482.372 66...

455 rows × 4 columns

Nearest neighbour analysis in Python

Nearest neighbors with Polygon and LineString data

# Plot buildings, parks and roads
ax = buildings.plot(color="gray", figsize=(10, 10))
ax = parks.plot(ax=ax, color="green")
ax = roads.plot(ax=ax, color="red")

Nearest neighbour analysis in Python

Nearest neighbors with Polygon and LineString data

We find the nearest park for each building Polygon and store the distance into the column distance:

nearest_parks = buildings.sjoin_nearest(parks, distance_col="distance")
nearest_parks
osmid_left building name_left geometry index_right osmid_right leisure name_right distance
0 11711721042 yes Nice Bike Pyörähuolto POINT (384966.661 6671503.786) 12 1227991181 park Kaartin lasaretin puisto 100.208527
1 8035238 public Lasipalatsi POLYGON ((385459.65 6672184.469, 385456.356 66... 1 8042613 park Simonpuistikko 16.284929
2 8042297 yes Radisson Blu Royal POLYGON ((385104.154 6671916.693, 385101.584 6... 8 37390082 park None 40.039501
3 14797170 school None POLYGON ((384815.326 6671762.71, 384815.792 66... 5 26999855 park None 0.000000
4 14797171 yes None POLYGON ((384797.759 6671853.253, 384798.253 6... 5 26999855 park None 14.873403
... ... ... ... ... ... ... ... ... ...
450 8092998 yes None POLYGON ((384747.465 6671811.996, 384744.27 66... 5 26999855 park None 70.819624
451 8280536 apartments None POLYGON ((384839.007 6671934.815, 384839.485 6... 8 37390082 park None 38.574646
452 8525159 civic None POLYGON ((385495.275 6672164.009, 385494.928 6... 1 8042613 park Simonpuistikko 32.792083
453 8525161 civic None POLYGON ((385486.225 6672173.653, 385486.717 6... 1 8042613 park Simonpuistikko 90.919207
454 8535506 civic None POLYGON ((385481.13 6672167.861, 385482.372 66... 1 8042613 park Simonpuistikko 87.821936

455 rows × 9 columns

print("Maximum distance:", nearest_parks["distance"].max().round(0))
print("Average distance:", nearest_parks["distance"].mean().round(0))
Maximum distance: 229.0
Average distance: 61.0

Nearest neighbour analysis in Python

Nearest neighbors with Polygon and LineString data

In a similar manner, we can also find the nearest road from each building as follows:

nearest_roads = buildings.sjoin_nearest(roads, distance_col="distance")
nearest_roads
osmid_left building name_left geometry index_right osmid_right name_right highway distance
0 11711721042 yes Nice Bike Pyörähuolto POINT (384966.661 6671503.786) 182 [126894680, 126894676, 126894678, 126894679] Eerikinkatu residential 11.181066
0 11711721042 yes Nice Bike Pyörähuolto POINT (384966.661 6671503.786) 24 [126894680, 126894676, 126894678, 126894679] Eerikinkatu residential 11.181066
1 8035238 public Lasipalatsi POLYGON ((385459.65 6672184.469, 385456.356 66... 15 [42574048, 42574049, 28920739, 77891210, 26999... Arkadiankatu secondary 52.015824
1 8035238 public Lasipalatsi POLYGON ((385459.65 6672184.469, 385456.356 66... 33 [42574048, 42574049, 28920739, 77891210, 26999... Arkadiankatu secondary 52.015824
2 8042297 yes Radisson Blu Royal POLYGON ((385104.154 6671916.693, 385101.584 6... 83 [37135576, 8035726, 37135575] Salomonkatu residential 6.659959
... ... ... ... ... ... ... ... ... ...
452 8525159 civic None POLYGON ((385495.275 6672164.009, 385494.928 6... 107 51707742 Yrjönkatu residential 88.553223
453 8525161 civic None POLYGON ((385486.225 6672173.653, 385486.717 6... 15 [42574048, 42574049, 28920739, 77891210, 26999... Arkadiankatu secondary 90.569914
453 8525161 civic None POLYGON ((385486.225 6672173.653, 385486.717 6... 33 [42574048, 42574049, 28920739, 77891210, 26999... Arkadiankatu secondary 90.569914
454 8535506 civic None POLYGON ((385481.13 6672167.861, 385482.372 66... 15 [42574048, 42574049, 28920739, 77891210, 26999... Arkadiankatu secondary 96.128437
454 8535506 civic None POLYGON ((385481.13 6672167.861, 385482.372 66... 33 [42574048, 42574049, 28920739, 77891210, 26999... Arkadiankatu secondary 96.128437

703 rows × 9 columns

Nearest neighbour analysis in Python

Nearest neighbors with Polygon and LineString data

To better understand how the spatial join between the buildings and roads have been conducted, we can again visualize the nearest neighbors with a straight line. To do this, we first bring the geometries from the roads GeoDataFrame into the same table with the buildings:

nearest_roads = nearest_roads.merge(
    roads[["geometry"]], left_on="index_right", right_index=True
)
nearest_roads.head(3)
osmid_left building name_left geometry_x index_right osmid_right name_right highway distance geometry_y
0 11711721042 yes Nice Bike Pyörähuolto POINT (384966.661 6671503.786) 182 [126894680, 126894676, 126894678, 126894679] Eerikinkatu residential 11.181066 LINESTRING (385040.141 6671566.384, 385034.832...
0 11711721042 yes Nice Bike Pyörähuolto POINT (384966.661 6671503.786) 24 [126894680, 126894676, 126894678, 126894679] Eerikinkatu residential 11.181066 LINESTRING (384942.149 6671500.856, 384950.743...
1 8035238 public Lasipalatsi POLYGON ((385459.65 6672184.469, 385456.356 66... 15 [42574048, 42574049, 28920739, 77891210, 26999... Arkadiankatu secondary 52.015824 LINESTRING (385285.226 6672266.801, 385296.799...

Nearest neighbour analysis in Python

Nearest neighbors with Polygon and LineString data

We can use shortest_line() from the shapely library that returns a LineString object between the input geometries showing the shortest distance between them.

from shapely import shortest_line


# Generate LineString between nearest points of two geometries
connectors = nearest_roads.apply(
    lambda row: shortest_line(row["geometry_x"], row["geometry_y"]), axis=1
)

# Create a new GeoDataFrame out of these geometries
connectors = gpd.GeoDataFrame({"geometry": connectors}, crs=roads.crs)
connectors["distance"] = connectors.length
connectors.head()
geometry distance
0 LINESTRING (384966.661 6671503.786, 384960.444... 11.181066
0 LINESTRING (384966.661 6671503.786, 384960.444... 11.181066
1 LINESTRING (385487.966 6672217.975, 385460.972... 52.015824
1 LINESTRING (385487.966 6672217.975, 385460.972... 52.015824
2 LINESTRING (385050.507 6671936.92, 385046.795 ... 6.659959

Nearest neighbour analysis in Python

Nearest neighbors with Polygon and LineString data

We can visualize the buildings, roads and these connectors to better understand the exact points where the distance between a given building and the closest road is shortest:

m = buildings.explore(color="gray", tiles="CartoDB Positron")
m = roads.explore(m=m, color="red")
m = connectors.explore(m=m, color="green")
m
Make this Notebook Trusted to load map: File -> Trust Notebook

Nearest neighbour analysis in Python

Exercise

  • What is the closest road to each park?
  • Use the parks and roads GeoDataFrames and follow the approaches presented above to find the closest road to each park.
  • What is the highest (maximum) distance between parks and roads present in our datasets?